Introduction to R is brought to you by the Centre for the Analysis of Genome Evolution & Function (CAGEF) bioinformatics training initiative. This course was developed based on feedback on the needs and interests of the Department of Cell & Systems Biology and the Department of Ecology and Evolutionary Biology.
The structure of this course is a code-along style; It is 100% hands on! A few hours prior to each lecture, links to the materials will be avaialable for download at QUERCUS. The teaching materials will consist of a Jupyter Lab Notebook with concepts, comments, instructions, and blank spaces that you will fill out with R by coding along with the instructor. Other teaching materials include an HTML version of the notebook, and datasets to import into R - when required. This learning approach will allow you to spend the time coding and not taking notes!
As we go along, there will be some in-class challenge questions for you to solve either individually or in cooperation with your peers. Post lecture assessments will also be available (see syllabus for grading scheme and percentages of the final mark) through DataCamp to help cement and/or extend what you learn each week.
We'll take a blank slate approach here to R and assume that you pretty much know nothing about programming. From the beginning of this course to the end, we want to get you from some potential scenarios:
A pile of data (like an excel file or tab-separated file) full of experimental observations and you don't know what to do with it.
Maybe you're manipulating large tables all in excel, making custom formulas and pivot table with graphs. Now you have to repeat similar experiments and do the analysis again.
You're generating high-throughput data and there aren't any bioinformaticians around to help you sort it out.
You heard about R and what it could do for your data analysis but don't know what that means or where to start.
and get you to a point where you can:
Format your data correctly for analysis
Produce basic plots and perform exploratory analysis
Make functions and scripts for re-analysing existing or new data sets
Track your experiments in a digital notebook like Jupyter!
In the first two lessons, we will talk about the basic data structures and objects in R, get cozy with the RStudio environment, and learn how to get help when you are stuck. Because everyone gets stuck - a lot! Then you will learn how to get your data in and out of R, how to tidy our data (data wrangling), subset and merge data, and generate descriptive statistics. Next will be data cleaning and string manipulation; this is really the battleground of coding - getting your data into the format where you can analyse it. After that, we will make all sorts of plots for both data exploration and publication. Lastly, we will learn to write customized functions and apply more advanced statistical tests, which really can save you time and help scale up your analyses.
The structure of the class is a code-along style: It is fully hands on. At the end of each lecture, the complete notes will be made available in a PDF format through the corresponding Quercus module so you don't have to spend your attention on taking notes.
This is the third in a series of seven lectures. Last lecture we focused in on manipulating data frames with the help of the dplyr package. This week we'll dig deeper into the tidyverse and how we can clean up our data. At the end of this session you will be familiar with the principles of tidy data; subsetting and transforming your data to perform simple calculations; and converting your data between wide and long formats. Our topics are broken into:
Grey background: Command-line code, R library and function names... fill in the code here if you are coding alongEach week, new lesson files will appear within your JupyterHub folders. We are pulling from a GitHub repository using this Repository git-pull link. Simply click on the link and it will take you to the University of Toronto JupyterHub. You will need to use your UTORid credentials to complete the login process. From there you will find each week's lecture files in the directory /2021-09-IntroR/Lecture_XX. You will find a partially coded skeleton.ipynb file as well as all of the data files necessary to run the week's lecture.
Alternatively, you can download the Jupyter Notebook (.ipynb) and data files from JupyterHub to your personal computer if you would like to run independently of the JupyterHub.
A live lecture version will be available at camok.github.io that will update as the lecture progresses. Be sure to refresh to take a look if you get lost!
As mentioned above, at the end of each lecture there will be a completed version of the lecture code released as a PDF file under the Modules section of Quercus. A recorded version of the lecture will be made available through the University's MyMedia website and a link will be posted in the Discussion section of Quercus.
Sequencing of the V3-V5 hypervariable regions of the 16S rRNA gene
16S rRNA gene amplicon sequencing of 30 latrines from Tanzania and Vietnam at different depths (multiples of 20cm). Microbial abundance is represented in Operational Taxonomic Units (OTUs). Operational Taxonomic Units (OTUs) are groups of organisms defined by a specified level of DNA sequence similarity at a marker gene (e.g. 97% similarity at the V4 hypervariable region of the 16S rRNA gene). Intrinsic environmental factors such as pH, temperature, organic matter composition were also recorded.
An OTU-taxa table.(Naming conventions: [Country_LatrineNo_Depth]) with sample names and environmental variables.
B Torondel, JHJ Ensink, O Gunvirusdu, UZ Ijaz, J Parkhill, F Abdelahi, V-A Nguyen, S Sudgen, W Gibson, AW Walker, and C Quince. Assessment of the influence of intrinsic environmental and geographical factors on the bacterial ecology of pit latrines Microbial Biotechnology, 9(2):209-223, 2016. DOI:10.1111/1751-7915.12334
This dataset is the result of 16S rRNA gene amplicon sequencing of samples from microbial communities cultured in fresh, brackish, or saline media. Treatments received the aromatic compounds toluene or pyrene as the sole source of carbon and energy. Controls did not receive any compounds (substrate-free) to account for any alternative carbon sources present in the media. The objective of this experiment was to evaluate which microorganisms would make use of toluene and pyrene.
Taxa table with gene sequences (ASV) and sample information. This file must be converted from wide to long format before use.
This data comes from the gapminder package which has an excerpt of data from gapminder.org circa 2010. It is a dataset used for exploring data and formatting it.
This file covers variables related to country statistics tracking life expectancy, population size, and GDP per capita over years spanning from 1952 to 2007. We'll explore this dataset a little at the end of lecture.
The following packages are used in this lesson:
tidyverse (tidyverse installs several packages for you, like dplyr, readr, readxl, tibble, and ggplot2)gapminder a dataset used for exploring data formatting with data pulled from gapminder.org circa 2010Open up the Anaconda prompt and use install your packages of interest.
conda install - starting command to call the installer for anaconda.
-c conda-forge - look for the package in the 'conda-forge' channel.
r-packagename - the name of the package you're interested in installing.
Combine the parts into a single command like:
conda install conda-forge r-gapminder
#--------- Install packages to for today's session ----------#
# install.packages("tidyverse", dependencies = TRUE) # This package should already be installed on Jupyter Hub
install.packages("gapminder", dependencies = TRUE)
#--------- Load packages to for today's session ----------#
library(tidyverse)
library(gapminder)
Note that 8 different packages are loaded with the tidyverse, and that 2 functions from the stats package have been replaced (superceded) by functions of the same name by dplyr. Note that you can still access the stats version of the function by calling it directly as stats::filter().
Wide and long (sometimes un-stacked and stacked, or wide and tall, wide and narrow), are terms used to describe how a database is formatted.
In wide format, variables may be listed in the first column, each forming a row of its own. Observations may be presented as columns that fill observed values for each variable.
In long format, each variable is its own column, and the results of each measured variable are recorded in rows. In data science, long format is preferred over wide format because it allows for an easier and more efficient subset and manipulation of the data.
To read more about wide and long formats, visit https://eagereyes.org/basics/spreadsheet-thinking-vs-database-thinking.
In this lesson we want to answer 3 simple questions:
Which latrine depth has the greatest mean number of OTUs? (remember the Country_LatrineNo_Depth site encoding)
Is there more Clostridia in Tanzania or Vietnam?
Which site had the greatest number of Taxa represented?
Last lesson, we learned how to filter and select data subsets we were interested in. However, we can make data manipulation more efficient by controlling the overall structure or format of our data.
Let's read in our dataset, store it in a variable, and remind ourselves about the original structure.
# Always check where you are first!
getwd()
list.files("...")
# Read in our file.
# Remember we're using the tidyverse to import our data!
data <- read_csv("...")
# Take a look at the head and structure of your data
head(data)
str(data, give.attr = FALSE) # Use the give.attr parameter to make it a little less verbose
# this is nearly equivalent now to glimpse(data)
To summarize, we see 52 rows of data, each with 82 columns.
Why tidy data?
Data cleaning (or dealing with 'messy' data, aka wrangling) accounts for a huge chunk of a data scientist's time. Ultimately, we want to get our data into a 'tidy' format (long format) where it is easy to manipulate, model and visualize. Having a consistent data structure and tools that work with that data structure can help this process along.
In Tidy data:
This seems pretty straight forward, and it is. The datasets you get, however, will not be straightforward. Having a map of where to take your data is helpful in unraveling its structure and getting it into a usable format.
The 5 most common problems with messy datasets are:
Fortunately there are some tools available to solve these problems.
The tidyverse is the universe of packages created by Hadley Wickham for data analysis. There are packages to help import, tidy, transform, model and visualize data. His packages are pretty popular, so he made a package to load all of his packages at once. This wrapper package is tidyverse. In this lesson series we have used dplyr, readr and readxl, and we will be using dplyr and tidyr today.
Hadley has a large fan-base. Someone even made a plot of Hadley using his own package, ggplot2.
Back to the normalverse...
# Let's review our data again
head(...)
Which tidy data rules might our data frame break?
At first glance we can see that the column names are actually 3 different variables: 'Country', 'LatrineNumber', and 'Depth'. This information will likely be useful in our study, as we expect different bacteria at different depths, sites, and geographical locations. Each of these is a variable and should have its own separate column.
We could keep the column names as the sample names (as they are meaningful to the researcher) and add the extra variable columns, or we could make up sample names (ie. Sample_1) knowing that the information is not being lost, but rather stored in a more useful format.
Some of the Taxa also appear to have an additional variable of information (ie. _Gp1), but not all taxa have this information. We could also make a separate column for this information.
Each result is the same observational unit (i.e. relative abundances of bacteria), so having one table is fine.
tidyr¶tidyr is a package with functions that help us turn our 'messy' data into 'tidy' data. It has 2 major workhorse functions and 2 other tidying functions:
pivot_longer - convert a data frame from wide to long formatpivot_wider - convert a data frame from long to wide formatseparate() - split a column into 2 or more columns based on a string separatorunite() - merge 2 or more columns into 1 column using a string separatorNote that pivot_longer() and pivot_wider() rely on unique key-value pairs to collapse or expand columns.
We've already loaded tidyverse which includes the tidyr package that the pivot_longer() function is from.
pivot_longer()¶Previously called the gather() function, the updated pivot_longer() function is used to collect our columns in a straightforward way. As the name implies, this will pivot our dataset from a wide to a long format.
pivot_longer(
data,
cols,
names_to = "name",
names_prefix = NULL,
names_sep = NULL,
names_pattern = NULL,
names_ptypes = list(),
names_transform = list(),
names_repair = "check_unique",
values_to = "value",
values_drop_na = FALSE,
values_ptypes = list(),
values_transform = list(),
...
)
We won't be using all of these arguments from pivot_longer() but there are a few we'll highlight here:
data - our data frame (actually a tibble as we mentioned last class but close enough...)cols - this is set of columns we wish to pivot. For each observation, all their column names will be stored into a single column, and all of their values into another. As we'll see, this will greatly increase the number of observations in our data. There are many ways to define the parameter for this argument.names_to - a string or character vector used to name of the column(s) where we'll store our column names retrieved from the cols set.names_sep - if names_to is a character vector, this controls how the column names are parsed (broken up).values_to - this is the name of the column where we'll store the values retrieved from the cols set.We've already imported taxa_pitlatrine_wide.csv and stored it as the variable data. Recall it is 52 rows (Taxa) and 82 columns of which there are 81 are observations coded as Country_LatrineNumber_Depth.
We'll use pivot_longer() to collect all the 81 observation columns and convert them into observation rows. The column names will be stored in a new variable called Site and the values from these columns will be in a variable called OTUs. Any untransformed columns become additional identifying data for each observation.
For now, we'll just step through the formatting process without saving the result into a variable.
# Remember we are piping "data" in as the first argument of gather
data %>%
# Use the pivot_longer command
pivot_longer(data = ., # Recall what the "." represents?
cols = ..., # The first column is Taxa and we don't need to pivot that
names_to = "...", # The variable where we'll store columns names
values_to = "...") %>% # The variable where we'll store column values
# Take a peek at the result
head()
cols in pivot_longer()¶If we look at the documentation for the pivot_longer() function we see that the argument cols uses something called <tidy-select>. This is an argument modifier that indicates an additiona layer of syntax or language can be used to select variables (columns) based on their names.
This formatting includes some selection helpers like:
everything(): matches all variables.last_col(): select last variable (or an offset from it).starts_with(), ends_with(), contains(), matches().all_of(): matches variable names in a character vector.For now we'll use some simpler examples to replicate what we did above.
# equivalent to all on a single line
data %>% pivot_longer(data = ., cols = c(2:82), names_to = "Site", values_to = "OTUs") %>% head()
# equivalent to cutting the first column
data %>% pivot_longer(data = ., cols = ..., names_to = "Site", values_to = "OTUs") %>% head()
# equivalent to cutting the first column by name
data %>% pivot_longer(data = ., cols = ..., names_to = "Site", values_to = "OTUs") %>% head()
In the above examples -1 means pivot every column except the 1st, or pivot every column except "Taxa". Taxa is still retained as a column but its values are not grouped in with 'Site' as an observation (i.e. we do not want 'Vietnam_9_4', 'Tanzania_2_9', and 'Clostridia' pivoted into the same column).
Let's save the last variation into a data frame called long_data.
# Assign a variable the result from
long_data ...
# passing data
data %>%
# to pivot_longer
pivot_longer(data = ., cols = -Taxa, names_to = "Site", values_to = "OTUs")
# Take a look at the result
head(long_data)
str(long_data)
pivot_longer()?¶Note how the dimensions of your dataframe have changed relative to data. Instead of 52 rows and 82 columns, we now have a data frame with 4,212 rows and 3 columns (which is the 81 columns we pivoted x 52 rows). long_data is now in a long format instead of wide.
separate() can split variables into multiple columns by specifying a text delimiter¶Notice the information contained in our "Site" data? It is a combination of 3 values: country, latrine number, and depth information all separated with "_" between each.
We can use the separate() function to retrieve the Country, LatrineNumber, and Depth information from our Site variable. The separate() function takes in your dataframe, the name of the column to be split, the names of your new columns, and the character that you want to split the columns by (in this case an underscore `). Note that the default is to remove your original column - you can keep it by adding the argumentremove = FALSE`, keeping in mind that you now have redundant data.
separate(
data,
col,
into,
sep = "[^[:alnum:]]+",
remove = TRUE,
convert = FALSE,
extra = "warn",
fill = "warn",
...
)
We need to provide separate() with information to help split our variable.
data - our data frame or tibblecol - the name or position of the column we want to split. into - a character vector of the column names we want to split from argument col. sep - tells separate() how to break up the information in each row of col.# long_data
long_data %>%
# after splitting Site into 3 new columns
separate(data = ., col = ..., into = c("Country", "Latrine_Number", "..."), sep = ...) %>%
# Take a peek at the result
head()
#equivalent to setting the parameter "remove" to TRUE, although this is the default behaviour
convert parameter¶Looking at our output from above there's one little thing we missed and you'll have to consider when using functions like separate() for yourself - do you want to automatically convert your split columns to the correct data type? The defauly behaviour of separate() is to convert your columns to the character data type. You may, however, want it to try and convert in a less agnostic fashion, but this could also introduce NA values so beware! If you are splitting many variables, however, it may be easier to automatically convert and fix any inadvertant variable types afterwards.
For instance, we may want the variable Depth as a numeric for statistical analysis whereas Latrine_Number may seem more appropriate as a label. Let's try to use the convert logical parameter for us.
# long_data
long_data %>%
# after splitting Site into 3 new columns
separate(data = ., col = Site, into = c("Country", "Latrine_Number", "Depth"), sep = "_", convert = ...) %>%
# Take a peek
head()
Let's fix that Latrine_Number variable while we are at it! We'll save the final form too
# Save as a variable
split_long_data <-
# long_data
long_data %>%
# after splitting Site into 3 new columns
separate(data = ., col = Site, into = c("Country", "Latrine_Number", "Depth"), sep = "_", convert = TRUE) %>%
# Remember how to mutate/change columns?
mutate(., Latrine_Number = ...)
# Check the result
str(split_long_data)
separate() versus pivot_longer()¶We just worked through an excellent example of how to use the separate() function but in the case of splitting the Site variable into three new variables, we could have taken a different approach. Remember back in section 1.3.1 we mentioned the pivot_longer() argument names_sep. In this situation, the Site variable is derived from the column titles we used in pivot_longer().
With careful consideration, we could combine our above step into the pivot_longer() step which can do an internal separate() call under the hood. This is obviously a common enough occurence in the data science community that this feature was added to pivot_longer().
To use this feature, we just need to specify the new column information correctly in the names_to parameter along with our names_sep parameter.
# passing data
data %>%
# to pivot_longer
pivot_longer(data = .,
cols = -Taxa,
... = c("Country", "Latrine_Number", "Depth"),
names_sep = ...,
... = list(Latrine_Number = as.factor, Depth = as.integer), # Set the variable types manually here
values_to = "OTUs") %>%
head()
See? Two sets of code, but we get exactly the same result!
Now getting back to our data, we may also want to split the Taxa column off based on the '_Gp' (Group) of Acidobacteria and other entries. We cannot do this step using pivot_longer() since this is an original variable untouched by the pivot_longer() process.
Let's try the code with separate() but do not save the answer in a variable.
# practice splitting "Taxa" into two columns: Taxa and Group using the separate command
split_long_data %>%
separate(Taxa, c("Taxa", "Group"), sep = ...) %>%
head(10)
Notice that we have triggered a warning from R that it has filled in 'NA' for the bacteria that did not have groups. Also see that we chose to split Taxa using '_Gp' since I did not need 'Gp'.
glimpse() and str() information for our split data¶Use the glimpse() function to look at the type of each variable in our new data frame, split_gathered_data. Are those the types you expected? Why or why not? How is glimpse() different from the str() function?
print ("glimpse of our split data")
...(split_long_data)
# Compare our
print ("structure of our split data")
str(split_long_data)
group_by() to organize our dataset¶Last lecture we learned a useful function group_by() that group data based on one or more variables. This is useful for calculations and plotting on subsets of your data without having to turn your variables into factors.
Suppose we wanted to look at a combination of Country and Well Depth. While visually, you wouldn't notice any changes to your data frame, if you look at the structure it will now be a grouped_df. There are 15 groupings resulting from Country and Depth. After we have performed our desired operation, we can return our data frame to its original structure by calling ungroup().
First we will examine the structure of grouped and ungrouped output without any additional operations.
# Look at the structure of our data
print("split_long_data in native format")
str(split_long_data)
# Group and then look at the structure of our data
print("split_long_data in a grouped data frame")
split_long_data %>% group_by(...) %>% str()
Note the additional attribute called groups in our grouped data? What do you think the values stored here represent?
# Group and then ungroup our data
print("split_long_data grouped and then ungrouped")
str(split_long_data %>% group_by(Country, Depth) %>% ...)
group_by() and summarise() functions to produce sensible stat data¶Recall last lecture we briefly used the group_by() and summarise() functions in combination to produce some preliminary statistics on our data. With this much larger set of data, this is a perfect time to revisit that combination of functions.
Whereas, in our initial messy data frame it was difficult to do calculations based on Country, Latrine Number or Well Depth, this is now an easy task. Let's get the mean, median, standard deviation and maximum value for the number of OTUs collected in Tanzania vs Vietnam.
# group and then summarise data
split_long_data %>%
group_by(...) %>% # Group by the country
summarise(mean = ..., # Calculate some basic statistics from each group
median = ...,
sd = sd (OTUs),
maximum = max(OTUs))
In dealing with grouped data, we no longer have to grab a Country by subsetting or using helper functions to grab letters from their names. group_by() produces that modified data frame which recognizes that we have 2 countries and will perform calculations for both of them.
Now that we have tidy data, let's proceed to answering our questions:
What steps do we need to take to answer this question? Before we dive in, let's consider what we need to answer this question.
# Group and then summarize the data before sorting in descending order
split_long_data %>%
group_by(...) %>% # group our data
summarise(mean = ...) %>% # summary stats
...(desc(mean)) # sort in descending order by the mean
Recall our original data set had three variables that were mashed together into a Site (Country_LatrineNo_Depth) and answering our question would have been difficult. With tidy data format we have split a combined variable column into three, making Latrine_Number a variable. This allowed us to simply group our data to perform our mean calculation and retrieve an answer.
This is a bit of variation on our first question. What kind of data do we need?
# Filter your data then group it before summarizing
split_long_data %>%
filter(...) %>% # filter the data based on Taxa
group_by(...) %>% # Why are we grouping with Taxa as well?
summarise(sum = sum(OTUs), # Summarise the OTU values
mean = mean(OTUs),
sd = sd(OTUs)) %>%
arrange(desc(sum)) # sort it all by descending total
Again, being able to filter by Taxa and group by Country (as an isolated variable) helps a lot. With dplyr syntax we can perform all data manipulations and calculations in a code block that is readable.
Note that we grouped by both Country and Taxa. This was done more for visual purposes to confirm that Clostridia was the only taxa coming out of our filter. We could easily have done group_by(Country) with the same result. Our final output, however, would not include the Taxa column.
Lets break down the question again into the basic components!
n() command from dplyr to count a group out.split_long_data %>%
group_by(Country, ...) %>% # Group by the three variables
filter(OTUs != ...) %>% # Remove OTUs with a value of 0
summarise(count = ...) %>% # Count total observations per group
arrange(desc(count)) %>% # Sort by the count number
head(., 20) # Look at just the first 20 rows
Since we can group by the 3 variables that were in the Site name, there is no disadvantage to having our data in tidy format compared to our original wide data frame. However now we are able to filter for non-zero OTUs, which was impossible in the wide format. Since we know from earlier in the lesson that each Taxa is only represented once for each site, we only have to count and order the number of observations to get our answer.
To get data back into its original format, there are reciprocal functions in the tidyr package, making it possible to switch between wide and long formats.
Fair question: But you've just been telling me how great the 'long' format is?!?! Why would I want the wide format again???
Honest answer: Note that our original data frame was 52 rows and expanded to 4212 rows in the long format. When you have, say, a genomics dataset you might end up with 6,000 rows expanding to 600,000 rows. You probably want to do your calculations and switch back to the more "human writeable/readable" format. Sure, I can save a data frame with 600,000 rows, but I can't really send it to anyone because spreadsheet software such as Excel might crash while trying to open the file.
unite() your columns back again¶unite(
data,
col,
...,
sep = "_",
remove = TRUE,
na.rm = FALSE
)
The opposite of separate(), we need to provide unite() with information to help consolidate our information.
unite() what kind of character to put between recombined data values going into col.Let's turn back time and collapse Country, Latrine_Number and Depth back into one variable, 'Site', using the unite() function. Store the output in a data frame called united_data.
# Put our three variables Country, Latrine_Number, and Depth back together
united_data = split_long_data %>%
unite(..., c("Country", "Latrine_Number", "Depth"),
sep = ...)
# Take a peek at the end result
head(united_data, 10)
pivot_wider() to convert your data from long to wide format¶pivot_wider(
data,
id_cols = NULL,
names_from = name,
names_prefix = "",
names_sep = "_",
names_glue = NULL,
names_sort = FALSE,
names_repair = "check_unique",
values_from = value,
values_fill = NULL,
values_fn = NULL,
...
)
The opposite of pivot_longer(), we need to provide pivot_wider() with information to help consolidate our information. This can be tricky to conceptualize BUT the goal is to consolidate row entries based on specific columns that we do NOT name.
data - our data frame (aka tibble) that we wish to convert.id_cols - the unique column(s) that will form the basis of identifying each observation. The default identifiers are any unselected columns from names_from and values_from.names_from - a column or multiple columns from which to pivot back to a wider format. These will become variable names whose values will come from...values_from - pairs with the names_from parameter to fill the new variable columns formed by names_from.Let's turn the hands of time back further and use the pivot_wider() function to turn united_data into the wide shape of our original dataset. Save the output into a data frame called wide_data.
# Remember that we still have 3 variables: Taxa, Site, OTUs. We want to reunite each site into a column of OTUs
wide_data <- united_data %>%
pivot_wider(data = .,
names_from = ...,
values_from = ...)
# Take a peek at the reformatted data
head(wide_data)
We'll use the standard write_csv command to save our results to our data folder.
getwd()
write_csv(x = wide_data,
file = "...",
col_names = TRUE)
We've gone full circle from pivoting a wide format dataset to long format, splitting multivariable columns and summarizing on the observations, and reverting the whole thing back to its original form. Now it's time to practice a bit more.
united_data¶united_data to remove all non-zero values, and store it as united_data_no_zeros. united_data_no_zeros to keep the top 20 rows with the highest OTUs.united_data_no_zeros <-
united_data %>% # pass the united_data variable
filter(OTUs...) %>% # remove OTUs with a value of 0
arrange(desc(OTUs)) %>% # sort the data by descending OTU
.[...] # index just the first 20 rows
## Don't forget what that "." means. Kind of a short-hand for "this"
# Take a look at the result
united_data_no_zeros
united_data had a value of zero?¶Let's alter our previous code to answer the question of just how many observations in united_data had a value of 0 in the OTUs variable.
united_data %>% # pass the united_data variable
filter(OTUs ... 0) %>% # remove OTUs with a value of 0
str(.)
microbes_wide.csv to long format¶Reshape microbes_wide.csv into a long format version called microbes_long. Make sure the columns are in the following order: "ASV", "salinity", "compound", "group", "replicate", "kingdom", "phylum", "class", "order", "family", "genus", and "abundance".
Once you're done, write microbes_long as microbes_long.csv.
Notes about microbes_wide.csv:
# Import your data file and look at it
microbes_wide <- read_csv("...")
head(microbes_wide)
Examining our data, the first two columns hold sequence and taxa information. The remaining 28 hold our multivariable data, and Let's begin by pivoting to long format before breaking up the multivariable columns of sample_name and taxa.
# Proceed to break the different observations and information apart
microbes_long <-
# pass along the data
microbes_wide %>%
# convert to long format
pivot_longer(., cols = ..., names_to = "sample_name", values_to = "abundance") %>%
# break up the sample_name column
separate(col = sample_name,
into = c("salinity", "compound", "group", "..."),
sep = "_") %>%
# break up the taxa column
separate(col = taxa,
into = c("kingdom", "phylum", "class", "order", "family", "..."),
sep = "_")
# Take a peek at the resulting data frame
head(microbes_long)
# Let's review the variable names
colnames(microbes_long)
Our column are in the wrong order! We need to shuffle "salinity", "compound", "group", and "replicate" towards the left. Relocate columns to match this order: "ASV", "salinity", "compound", "group", "replicate", "kingdom", "phylum", "class", "order", "family", "genus", and "abundance".
# There are a few approaches we could take here, but indexing might be the easiest
microbes_long <- microbes_long[ , c(...)]
# recheck those column names
colnames(microbes_long)
Is there a dplyr function we could use instead? Take some time to think about it!
Now just write the data frame to file as microbes_long.csv.
getwd()
write_csv(microbes_long, file="...", col_names = TRUE)
gapminder_wide to a long format¶Read in the gapminder_wide.csv. What rules of tidy data does it break? Transform the dataset to the format below.
| continent | country | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Asia | Afghanistan | 1952 | 28.801 | 8425333 | 779.4453 |
| Asia | Afghanistan | 1957 | 30.332 | 9240934 | 820.8530 |
| Asia | Afghanistan | 1962 | 31.997 | 10267083 | 853.1007 |
| Asia | Afghanistan | 1967 | 34.020 | 11537966 | 836.1971 |
| Asia | Afghanistan | 1972 | 36.088 | 13079460 | 739.9811 |
| Asia | Afghanistan | 1977 | 38.438 | 14880372 | 786.1134 |
gapminder_longReshape gapminder_wide into a long, cleansed format. BEFORE you start writing any code, IDENTIFY the formatting deficiencies and then PLAN ahead what you want to achieve.
# read gapminder_wide.csv
gapminder_wide <- read_csv("...")
# Take a peek at the data
head(gapminder_wide)
# Look at the structure of our data
glimpse(gapminder_wide)
Observations appear to be split into visible categories based on continent and country (variables 1 and 2) the remaining columns are a combination of:
gdpPercap_year - the GDP per capita for a specific year in that continent/country combinationlifeExp_year - the life expectancy for a specfific year in that continent/country combinationpop_year - the population for a specific year in that continent/country combinationSo we are seeing multiple observations per column AND multiple observation types in the table (GPP, life expectancy and population)! What a mess!!
# Let's do a bulk conversion into long format
gapminder_long <-
gapminder_wide %>%
pivot_longer(., cols = -c("country", "continent"), # Note that this time we've removed some columns as a parameter
names_to = "...",
values_to = "...")
# Take a peek at the long-format result
head(gapminder_long)
We've generated a long format table but we still have obs_type as a variable which holds two kinds of information:
We need to break that up into separate variables.
# We need to break the multiple observations from a column into multiple columns
# Divide obs_type into "obs_type" and "year"
gapminder_long <-
gapminder_long %>%
separate(obs_type, c(...), sep = "_", convert = TRUE)
# Take a peek
head(gapminder_long)
# What does our data frame look like now?
str(gapminder_long)
pivot_wider() to widen your table based on a mixed category observation¶We've successfully separate our dual-variable in obs_type and year. However the three observation types are still trapped in the same table and column! Let's fix this so that each observation type (and it's value) becomes it's own variable. We'll have to take a small step back before we can take a step forward. Send it back into slightly wider format!
# Spread data from obs_type as its contents should be their own variables
gapminder_long <-
gapminder_long %>%
pivot_wider(., names_from = ..., values_from = ...)
# Take a peek at the wider table
head(gapminder_long)
# Verify it's structure
str(gapminder_long)
select() to rearrange your columns!¶We now have 1/3 as many observations (rows) in our data frame because that data has been pivoted into 3 new columns! Just what we wanted. Nearly there! Let's review what we want as a final data table.
| continent | country | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Asia | Afghanistan | 1952 | 28.801 | 8425333 | 779.4453 |
| Asia | Afghanistan | 1957 | 30.332 | 9240934 | 820.8530 |
| Asia | Afghanistan | 1962 | 31.997 | 10267083 | 853.1007 |
| Asia | Afghanistan | 1967 | 34.020 | 11537966 | 836.1971 |
| Asia | Afghanistan | 1972 | 36.088 | 13079460 | 739.9811 |
| Asia | Afghanistan | 1977 | 38.438 | 14880372 | 786.1134 |
Let's put that into order now and it looks like we've sorted our data by country name rather than continent. Then we'll save it to file.
# Rearrange columns (optional)
# Don't forget some of our commands from lecture 2! ie 2.3.0 select()
gapminder_long <-
gapminder_long %>%
select(...) %>%
arrange(., country)
head(gapminder_long)
# Check the directory
getwd()
# Write our final data file
write_csv(x = ...,
file = "data/gapminder_long.csv",
col_names = TRUE)
ggplot¶Exploratory plots allow us to quickly answer questions about our data or to get an idea of trends and data distribution. This week the DataCamp assignment will have you making a few basic plots with the data that you format. We'll be discussing plots using the ggplot package in much more detail next lecture. For now, here is an overview for producing a basic line or scatterplot using ggplot.
geom_*()¶To produce a ggplot object, you need to assign a minimum of 3 aspects:
ggplot(): this initializes a ggplot object, and requires a dataset of some kind. You set it with the argument data.aes(): this sets the details of how your data will be used. For example the x and y axes are set here, if applicable. You can also define how data are grouped with arguments like colour, fill, and size, setting these using variable names from data.geom_*(): or geometric object defines the type of plot you want to visualize. The simplest verions are geom_line() (line graph), geom_bar() (bar chart), and geom_point() (scatterplots, dotplots, and bubble charts).Let's filter some gapminder_long data to plot as a line graph and scatter plot. From the results we may be able to glean some insights!
# Load the repr package to change default plot width
library(repr)
# Adjust our plot window size according to the expected output
options(repr.plot.width=12, repr.plot.height=7)
# First filter gapminder_long
gapminder_long %>%
filter(continent == ...) %>% # Use only data from the continent of Asia
# pass the data to ggplot
ggplot(data = .)+
# Tell ggplot which columns to draw data from
aes(x=..., y = ..., colour = ...) +
# Define how you want the plot to be visualized
...
It looks like life expectancy is generally on the increase in countries throughout Asia. What about the life expectancy across the world based on factors like GDP per capita (ie how rich a country is)?
# First filter gapminder_long
gapminder_long %>%
filter(year == 2007) %>% # Take a look at just data from our last snapshot year: 2007
# Pass the data to ggplot
ggplot(data = .)+
# Tell ggplot which columns to draw data from
aes(x=..., y = ..., colour = ...) +
# Define how you want the plot to be visualized
...
With just a few lines of code, we can see that the overall trend suggests that life expectancy is positively correlated with GDP. It's not always the case, from the look of some African countries, but that's the general trend we see from a 2007 snapshot of data. Next week we'll learn even more about using ggplot to visualize our data and customize its presentation.
That's the end for our third class on R! You've made it through and we've learned about the following:
ggplotSoon after the end of this lecture, a homework assignment will be available for you in DataCamp. Your assignment is to complete two chapters from the Reshaping Data with tidyr course: Tidy data (950 points) and From wide to long and back (1300 points). This is a pass-fail assignment, and in order to pass you need to achieve a least 1,687 points (75%) of the total possible points. Note that when you take hints from the DataCamp chapter, it will reduce your total earned points for that chapter.
In order to properly assess your progress on DataCamp, at the end of each chapter, please take a screenshot of the entire course summary. You'll see this under the "Course Outline" menubar seen at the top of the page for each course and you'll want to expand each section. It should look something like this:
You may need to take several screenshots if you cannot print it all in a single try. Submit the file(s) or a combined PDF for the homework to the assignment section of Quercus. By submitting your scores for each section, and chapter, we can keep track of your progress, identify knowledge gaps, and produce a standardized way for you to check on your assignment "grades" throughout the course.
You will have until 13:59 hours on Thursday, October 7th to submit your assignment (right before the next lecture).
Revision 1.0.0: materials prepared in R Markdown by Oscar Montoya, M.Sc. Bioinformatician, Education and Outreach, CAGEF.
Revision 1.1.0: edited and preprared in Jupyter Notebook by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
This class is supported by DataCamp, the most intuitive learning platform for data science and analytics. Learn any time, anywhere and become an expert in R, Python, SQL, and more. DataCamp’s learn-by-doing methodology combines short expert videos and hands-on-the-keyboard exercises to help learners retain knowledge. DataCamp offers 350+ courses by expert instructors on topics such as importing data, data visualization, and machine learning. They’re constantly expanding their curriculum to keep up with the latest technology trends and to provide the best learning experience for all skill levels. Join over 6 million learners around the world and close your skills gap.
Your DataCamp academic subscription grants you free access to the DataCamp's catalog for 6 months from the beginning of this course. You are free to look for additional tutorials and courses to help grow your skills for your data science journey. Learn more (literally!) at DataCamp.com.